Goal:

Code basic models in R

Requirements:

  1. R or Rstudio
  2. Packages:

Input Data:

Users can load an Rdata frame which contains three R objects:

  1. finalmetab: a matrix containing metabolite abundance values.
  2. finalgenes: a matrix containing gene abundance values
  3. finannot: annotation (meta-data) of the samples where the column “cell_line” matches that of the metab matrix.

You can also find all the data in an Excel sheet.

Tips:

Load Data and Take a Look

load("/Users/math90/Documents/Teaching/Ethiopia/OHSI_DataAnalytics_Aug2019/Data/NCI60/CleanGeneMetab_NCI_60.Rdata")

# Look at what objects are in your data:
ls()
## [1] "finalgenes" "finalmetab" "finannot"   "smallmetab"
# Look at the dimensions of each:
dim(finalgenes)
## [1]    57 17987
dim(finalmetab)
## [1]  57 349
dim(finannot)
## [1] 57  4
dim(smallmetab)
## [1] 11  8

What information is available for each cell line? What is the summary for each type of variable? What is the size of each data frame? What’s in the rows and columns?

head(finannot)
##   cell_line   drugscore   drugcateg cancertype
## 1     786-0  0.05153939 No Response      Renal
## 2      A498 -0.45286203   Resistant      Renal
## 3 A549/ATCC -0.41449565   Resistant      NSCLC
## 4      ACHN -0.03506064 No Response      Renal
## 5    BT-549 -0.34282314   Resistant     Breast
## 6    CAKI-1 -0.03703719 No Response      Renal
summary(finannot)
##      cell_line    drugscore              drugcateg   cancertype       
##  786-0    : 1   Min.   :-1.09231   No Response:30   Length:57         
##  A498     : 1   1st Qu.:-0.30818   Resistant  :16   Class :character  
##  A549/ATCC: 1   Median :-0.05460   Sensitive  :11   Mode  :character  
##  ACHN     : 1   Mean   :-0.05754                                      
##  BT-549   : 1   3rd Qu.: 0.14492                                      
##  CAKI-1   : 1   Max.   : 0.81607                                      
##  (Other)  :51

Double check (you should always do this!) that the IDs in your sample meta data (finannot) match those of the abundance dta (finalgenes, finalmetab)

all.equal(rownames(finalgenes), finannot$cell_line)
## [1] "Modes: character, numeric"                      
## [2] "Attributes: < target is NULL, current is list >"
## [3] "target is character, current is factor"
# so that didn't work because finalgenes is a matrix and finannot is a data
# frame
all.equal(rownames(finalgenes), as.character(finannot$cell_line))
## [1] TRUE
all.equal(rownames(finalmetab), as.character(finannot$cell_line))
## [1] TRUE

Should we log our data? It’s best to have our samples as columns for much of the plotting so let’s take the transpose first.

set.seed(1)
finalmetab <- t(finalmetab)
finalgenes <- t(finalgenes)

par(mfrow = c(2, 2))
boxplot(as.data.frame(finalmetab), main = "Histogram of Metabolomic Data")
boxplot(as.data.frame(log2(finalmetab)), main = "Histogram of LOG Metabolomic Data")
boxplot(as.data.frame(finalgenes), main = "Histogram of Gene Expression Data")
boxplot(as.data.frame(log2(finalgenes)), main = "Histogram of LOG Gene Expression Data")

finalgenes <- log2(finalgenes)
finalmetab <- log2(finalmetab)

You should be familiar with this data, if not, look at the file “NCI60_HandsOn.html”

Running a linear model, looking for the association between a gene and drug score.

Let’s try this on one gene, let’s randomly pick a gene, and run a linear model to look at the association between the drug score and that gene.

First, let’s pick a gene and look at its distribution.

set.seed(1)
gene.index <- sample(1:ncol(finalgenes), 1)
mygene <- finalgenes[gene.index, ]
boxplot(mygene, main = rownames(finalgenes)[gene.index])

Now that we have a gene, let’s run a linear model to look at the association between that gene and drug response score.

mylm <- glm(mygene ~ finannot$drugscore, family = "gaussian")
mylm
## 
## Call:  glm(formula = mygene ~ finannot$drugscore, family = "gaussian")
## 
## Coefficients:
##        (Intercept)  finannot$drugscore  
##            11.5807              0.4104  
## 
## Degrees of Freedom: 56 Total (i.e. Null);  55 Residual
## Null Deviance:       10.79 
## Residual Deviance: 8.978     AIC: 62.41
summary(mylm)
## 
## Call:
## glm(formula = mygene ~ finannot$drugscore, family = "gaussian")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.77617  -0.30158  -0.03866   0.26774   0.82468  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        11.58070    0.05398 214.524  < 2e-16 ***
## finannot$drugscore  0.41039    0.12321   3.331  0.00155 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1632434)
## 
##     Null deviance: 10.7895  on 56  degrees of freedom
## Residual deviance:  8.9784  on 55  degrees of freedom
## AIC: 62.41
## 
## Number of Fisher Scoring iterations: 2
plot(mylm)

# Code to get coefficients:
summary(mylm)$coefficients[, "Estimate"]
##        (Intercept) finannot$drugscore 
##         11.5806979          0.4103902
summary(mylm)$coefficients["finannot$drugscore", "Estimate"]
## [1] 0.4103902
# Code to get p-values:
summary(mylm)$coefficients[, "Pr(>|t|)"]
##        (Intercept) finannot$drugscore 
##       4.411216e-82       1.551540e-03
summary(mylm)$coefficients["finannot$drugscore", "Pr(>|t|)"]
## [1] 0.00155154

Note that linear models are not reversible! If you switch the outcome and the covariables, you will not get the same answer!!

Let’s take a look:

flipmylm <- glm(finannot$drugscore ~ mygene, family = "gaussian")
flipmylm
## 
## Call:  glm(formula = finannot$drugscore ~ mygene, family = "gaussian")
## 
## Coefficients:
## (Intercept)       mygene  
##      -4.785        0.409  
## 
## Degrees of Freedom: 56 Total (i.e. Null);  55 Residual
## Null Deviance:       10.75 
## Residual Deviance: 8.948     AIC: 62.22
summary(flipmylm)$coefficients
##               Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) -4.7846528  1.4201953 -3.36901 0.001384099
## mygene       0.4090227  0.1227983  3.33085 0.001551540
summary(mylm)$coefficients
##                      Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)        11.5806979  0.0539832 214.52411 4.411216e-82
## finannot$drugscore  0.4103902  0.1232089   3.33085 1.551540e-03

Are there any potential confounders? Let’s go ahead and take those into account.

mylm2 <- glm(mygene ~ finannot$drugscore + finannot$cancertype, family = "gaussian")
mylm2
## 
## Call:  glm(formula = mygene ~ finannot$drugscore + finannot$cancertype, 
##     family = "gaussian")
## 
## Coefficients:
##                 (Intercept)           finannot$drugscore  
##                     11.4051                       0.3312  
##      finannot$cancertypeCNS     finannot$cancertypeColon  
##                      0.2645                       0.4167  
## finannot$cancertypeLeukemia  finannot$cancertypeMelanoma  
##                      0.1420                       0.4667  
##    finannot$cancertypeNSCLC   finannot$cancertypeOvarian  
##                      0.1691                       0.1480  
## finannot$cancertypeProstate     finannot$cancertypeRenal  
##                     -0.1222                      -0.2361  
## 
## Degrees of Freedom: 56 Total (i.e. Null);  47 Residual
## Null Deviance:       10.79 
## Residual Deviance: 6.385     AIC: 58.98
summary(mylm2)
## 
## Call:
## glm(formula = mygene ~ finannot$drugscore + finannot$cancertype, 
##     family = "gaussian")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6162  -0.2227  -0.0238   0.2517   0.6944  
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  11.4051     0.1651  69.071   <2e-16 ***
## finannot$drugscore            0.3312     0.1485   2.230   0.0305 *  
## finannot$cancertypeCNS        0.2645     0.2235   1.183   0.2427    
## finannot$cancertypeColon      0.4167     0.2173   1.918   0.0613 .  
## finannot$cancertypeLeukemia   0.1420     0.2483   0.572   0.5701    
## finannot$cancertypeMelanoma   0.4667     0.2102   2.220   0.0313 *  
## finannot$cancertypeNSCLC      0.1691     0.2068   0.818   0.4178    
## finannot$cancertypeOvarian    0.1480     0.2205   0.671   0.5054    
## finannot$cancertypeProstate  -0.1222     0.3101  -0.394   0.6954    
## finannot$cancertypeRenal     -0.2361     0.2170  -1.088   0.2822    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1358458)
## 
##     Null deviance: 10.7895  on 56  degrees of freedom
## Residual deviance:  6.3848  on 47  degrees of freedom
## AIC: 58.978
## 
## Number of Fisher Scoring iterations: 2
plot(mylm2)

# Code to get coefficients:
summary(mylm2)$coefficients[, "Estimate"]
##                 (Intercept)          finannot$drugscore 
##                  11.4050986                   0.3312194 
##      finannot$cancertypeCNS    finannot$cancertypeColon 
##                   0.2644965                   0.4166773 
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma 
##                   0.1419963                   0.4666523 
##    finannot$cancertypeNSCLC  finannot$cancertypeOvarian 
##                   0.1690627                   0.1479707 
## finannot$cancertypeProstate    finannot$cancertypeRenal 
##                  -0.1221830                  -0.2360613
summary(mylm2)$coefficients["finannot$drugscore", "Estimate"]
## [1] 0.3312194
# Code to get p-values:
summary(mylm2)$coefficients[, "Pr(>|t|)"]
##                 (Intercept)          finannot$drugscore 
##                6.500499e-49                3.054078e-02 
##      finannot$cancertypeCNS    finannot$cancertypeColon 
##                2.426703e-01                6.125147e-02 
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma 
##                5.701025e-01                3.131324e-02 
##    finannot$cancertypeNSCLC  finannot$cancertypeOvarian 
##                4.177617e-01                5.054287e-01 
## finannot$cancertypeProstate    finannot$cancertypeRenal 
##                6.953627e-01                2.822152e-01
summary(mylm2)$coefficients["finannot$drugscore", "Pr(>|t|)"]
## [1] 0.03054078

Running an ANOVA/ANCOVA, looking for the association between a gene and drug category.

We can use the same gene as we did above here. Let’s run an ANOVA to look for the association between the gene abundance level, and the drug category

myanova <- aov(mygene ~ finannot$drugcateg)
summary(myanova)
##                    Df Sum Sq Mean Sq F value Pr(>F)  
## finannot$drugcateg  2  1.431  0.7155   4.129 0.0215 *
## Residuals          54  9.358  0.1733                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Here's how to get the p-value:
summary(myanova)[[1]]["finannot$drugcateg", "Pr(>F)"]
## [1] 0.02145496

Now, we know that there is indeed a difference in expression value of the gene associated with drug category. But we have 3 drug categories, so where is the difference? Between Resistant and No Response? Between Sensitive and No Response? Or between Sensitive and Resistant? For this, we are going to calculate the Tukey Honest Significance Differences, which takes into account multiple comparisons.

mytukey <- TukeyHSD(myanova)

# To get the p-values for each group (you would replace 'finannot$drugcateg'
# by the name of your category:
mytukey$`finannot$drugcateg`[, "p adj"]
## Resistant-No Response Sensitive-No Response   Sensitive-Resistant 
##            0.05382105            0.69900743            0.03107014

We know that cancer type could be a confounding variable that could impact the association of the gene and the drug category, let’s run an ANCOVA and adjust for cancer type.

myanova2 <- aov(mygene ~ finannot$drugcateg + finannot$cancertype)
summary(myanova2)
##                     Df Sum Sq Mean Sq F value Pr(>F)  
## finannot$drugcateg   2  1.431  0.7155   4.973 0.0111 *
## finannot$cancertype  8  2.741  0.3426   2.381 0.0308 *
## Residuals           46  6.618  0.1439                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Here's how to get the p-value:
summary(myanova2)[[1]]["finannot$drugcateg", "Pr(>F)"]
## [1] 0.0110821
# Now Tukey:
mytukey2 <- TukeyHSD(myanova)

# To get the p-values for each group (you would replace 'finannot$drugcateg'
# by the name of your category:
mytukey2$`finannot$drugcateg`[, "p adj"]
## Resistant-No Response Sensitive-No Response   Sensitive-Resistant 
##            0.05382105            0.69900743            0.03107014

How does these results compare to the linear models that use drug as a continuous variable, not a categorical variable?

Logistic Regression

In some cases, you may have a dichotomous variable, and you wish to see the association of this dichotomous variable with an outcome.

In our case, we could dichotomize the gene levels into two categories, “hi” and “low”, where “hi” cell lines have abundance values greater than the mean. Let’s give it a try:

genemean <- mean(mygene)
genecateg <- mygene
genecateg[which(mygene < genemean)] = "low"
genecateg[which(mygene >= genemean)] = "hi"
table(genecateg)
## genecateg
##  hi low 
##  29  28

Important: you need to make sure that your categories are factors or else your model won’t work. The reference sample is determined by alpha-numerical order (in this case, “hi”). There are ways to change this though:

genecateg <- as.factor(genecateg)
levels(genecateg)
## [1] "hi"  "low"
genecateg <- relevel(genecateg, ref = "low")
levels(genecateg)
## [1] "low" "hi"

We can use the glm() function (generalized linear models function) again but specify that we want to run a logistic model:

mylogit <- glm(genecateg ~ finannot$drugscore, family = binomial(link = "logit"))
mylogit
## 
## Call:  glm(formula = genecateg ~ finannot$drugscore, family = binomial(link = "logit"))
## 
## Coefficients:
##        (Intercept)  finannot$drugscore  
##             0.1078              1.2136  
## 
## Degrees of Freedom: 56 Total (i.e. Null);  55 Residual
## Null Deviance:       79 
## Residual Deviance: 75.38     AIC: 79.38
summary(mylogit)
## 
## Call:
## glm(formula = genecateg ~ finannot$drugscore, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6649  -1.1023   0.8004   1.0886   1.5252  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          0.1078     0.2767   0.390   0.6968  
## finannot$drugscore   1.2136     0.6657   1.823   0.0683 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79.001  on 56  degrees of freedom
## Residual deviance: 75.376  on 55  degrees of freedom
## AIC: 79.376
## 
## Number of Fisher Scoring iterations: 4
plot(mylogit)

# Code to get coefficients:
summary(mylogit)$coefficients[, "Estimate"]
##        (Intercept) finannot$drugscore 
##          0.1078333          1.2136056
summary(mylogit)$coefficients["finannot$drugscore", "Estimate"]
## [1] 1.213606
# Code to get p-values:
summary(mylogit)$coefficients[, "Pr(>|z|)"]
##        (Intercept) finannot$drugscore 
##         0.69676697         0.06831394
summary(mylogit)$coefficients["finannot$drugscore", "Pr(>|z|)"]
## [1] 0.06831394

Again, we can adjust for the effect of cancer type:

mylogit2 <- glm(genecateg ~ finannot$drugscore + finannot$cancertype, family = binomial(link = "logit"))
mylogit2
## 
## Call:  glm(formula = genecateg ~ finannot$drugscore + finannot$cancertype, 
##     family = binomial(link = "logit"))
## 
## Coefficients:
##                 (Intercept)           finannot$drugscore  
##                     -0.3864                       0.3030  
##      finannot$cancertypeCNS     finannot$cancertypeColon  
##                     -0.2624                       1.2725  
## finannot$cancertypeLeukemia  finannot$cancertypeMelanoma  
##                      1.7940                       2.3425  
##    finannot$cancertypeNSCLC   finannot$cancertypeOvarian  
##                      0.2282                       0.2095  
## finannot$cancertypeProstate     finannot$cancertypeRenal  
##                      0.4733                     -17.1146  
## 
## Degrees of Freedom: 56 Total (i.e. Null);  47 Residual
## Null Deviance:       79 
## Residual Deviance: 58.77     AIC: 78.77
summary(mylogit2)
## 
## Call:
## glm(formula = genecateg ~ finannot$drugscore + finannot$cancertype, 
##     family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1182  -0.9881   0.4923   0.8240   1.5632  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                   -0.3864     0.9156  -0.422    0.673  
## finannot$drugscore             0.3030     0.9143   0.331    0.740  
## finannot$cancertypeCNS        -0.2624     1.2618  -0.208    0.835  
## finannot$cancertypeColon       1.2725     1.2475   1.020    0.308  
## finannot$cancertypeLeukemia    1.7940     1.5733   1.140    0.254  
## finannot$cancertypeMelanoma    2.3425     1.4071   1.665    0.096 .
## finannot$cancertypeNSCLC       0.2282     1.1429   0.200    0.842  
## finannot$cancertypeOvarian     0.2095     1.2234   0.171    0.864  
## finannot$cancertypeProstate    0.4733     1.6971   0.279    0.780  
## finannot$cancertypeRenal     -17.1146  1494.1310  -0.011    0.991  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79.001  on 56  degrees of freedom
## Residual deviance: 58.767  on 47  degrees of freedom
## AIC: 78.767
## 
## Number of Fisher Scoring iterations: 16
plot(mylogit2)

# Code to get coefficients:
summary(mylogit2)$coefficients[, "Estimate"]
##                 (Intercept)          finannot$drugscore 
##                  -0.3863637                   0.3030245 
##      finannot$cancertypeCNS    finannot$cancertypeColon 
##                  -0.2623841                   1.2725390 
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma 
##                   1.7940351                   2.3424781 
##    finannot$cancertypeNSCLC  finannot$cancertypeOvarian 
##                   0.2281633                   0.2095381 
## finannot$cancertypeProstate    finannot$cancertypeRenal 
##                   0.4732513                 -17.1145544
summary(mylogit2)$coefficients["finannot$drugscore", "Estimate"]
## [1] 0.3030245
# Code to get p-values:
summary(mylogit2)$coefficients[, "Pr(>|z|)"]
##                 (Intercept)          finannot$drugscore 
##                  0.67303616                  0.74031070 
##      finannot$cancertypeCNS    finannot$cancertypeColon 
##                  0.83526891                  0.30768899 
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma 
##                  0.25416655                  0.09596663 
##    finannot$cancertypeNSCLC  finannot$cancertypeOvarian 
##                  0.84177172                  0.86400264 
## finannot$cancertypeProstate    finannot$cancertypeRenal 
##                  0.78035100                  0.99086081
summary(mylogit2)$coefficients["finannot$drugscore", "Pr(>|z|)"]
## [1] 0.7403107

Is this what you would expect? Is it better to dichotomize your outcome or keep the values as continuous?

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.1  magrittr_1.5    formatR_1.7     tools_3.6.1    
##  [5] htmltools_0.3.6 yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3  
##  [9] rmarkdown_1.14  knitr_1.23      stringr_1.4.0   xfun_0.8       
## [13] digest_0.6.20   evaluate_0.14